Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer
21
Block.0.query
Block.3.query
Block.6.query
(b) Q-ViT
Block.0.query
Block.3.query
Block.6.query
(a) Full-Precision
FIGURE 2.3
The histogram of query values q (shadow) along with the PDF curve of Gaussian distri-
bution N(μ, σ2) [195], for three selected layers in DeiT-T and 4-bit fully quantized DeiT-T
(baseline). μ and σ2 are the statistical mean and variance of the values.
For ease of training, the input to the matrix multiplication layers is set to ˆv, mathe-
matically equivalent to the inference operations described earlier. The input activations and
weights are set to 2, 3, 4, or 8 bits for all matrix multiplication layers except the first and
last, which are always set to 8 bits. This standard practice in quantized networks has been
shown to improve performance significantly. All other parameters are represented using
FP32. Quantized networks are initialized using weights from a trained full-precision model
with a similar architecture before being fine-tuned in the quantized space.
2.3
Q-ViT: Accurate and Fully Quantized Low-Bit Vision
Transformer
Inspired by the success of natural language processing (NLP), transformer-based mod-
els have shown great power in various computer vision (CV) tasks, such as image clas-
sification [60] and object detection [31]. Pre-trained with large-scale data, these mod-
els usually have many parameters. For example, 632M parameters consume 2528 MB of
memory usage and 162G FLOPs in the ViT-H model, which is expensive in both mem-
ory and computation during inference. This limits the deployment of these models on
resource-limited platforms. Therefore, compressed transformers are urgently needed for real
applications.
Quantization-aware training (QAT) [158] methods perform quantization during back-
propagation and achieve much less performance drop with a higher compression rate in
general. QAT is effective for CNN models [159] for CV tasks. However, QAT methods still
need to be explored for low-bit quantization of vision transformers. Therefore, we first build
a fully quantized ViT baseline, a straightforward yet effective solution based on standard
techniques. Our study discovers that the performance drop of fully quantized ViT lies in the
information distortion among the attention mechanism in the forward process and the in-
effective optimization for eliminating the distribution difference through distillation in the
backward propagation. First, the ViT attention mechanism aims to model long-distance
dependencies [227, 60]. However, our analysis shows that a direct quantization method
leads to information distortion and a significant distribution variation for the query mod-
ule between the quantized ViT and its full-precision counterpart. For example, as shown
in Fig. 2.3, the variance difference is 0.4409 (1.2124 vs. 1.6533) for the first block 1. This
1 supports the Gaussian distribution hypothesis citeqin2022bibert